Skip to content

[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777

Open
ShaneGZhu wants to merge 7 commits into
PaddlePaddle:developfrom
ShaneGZhu:get_moe_score
Open

[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777
ShaneGZhu wants to merge 7 commits into
PaddlePaddle:developfrom
ShaneGZhu:get_moe_score

Conversation

@ShaneGZhu
Copy link
Copy Markdown
Contributor

@ShaneGZhu ShaneGZhu commented May 11, 2026

Motivation

Kernel fusion: cast + sigmoid + bias + noauxtc. Currently, this is supported only on CUDA devices.

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

  • 新增 custom_ops/gpu_ops/grouped_topk_kernels.cu:实现 grouped_topk_fused_kernel,一次 kernel launch 完成 cast、sigmoid、bias 加法及 grouped topk 路由;支持 float32/bfloat16/float16 输入
  • custom_ops/gpu_ops/cpp_extensions.cc:新增 grouped_topk 函数声明及 pybind11 binding
  • custom_ops/setup_ops.py:将新 .cu 文件加入两处编译源列表
  • fastdeploy/model_executor/layers/moe/moe.pyget_moe_scoresuse_fused=True 时走新 grouped_topk 路径,替代原 fused_cast_sigmoid_bias + noaux_tc 双调用
  • tests/operators/test_grouped_topk_op.py:新增覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置的正确性与数值对齐测试

Usage or Command

N/A

Accuracy Tests

测试分支 并行方式 模型 主要对比 请求数量 引擎MAX BS 平均输入 平均输出 TPS OTPS QPS TTFT(ms) 解码速度(tok/s)
develop TP8 GLM4.5-Air baseline 256 256 159.53 5433.42 3263.46 3170.38 0.583 1257.81 30.38
develop TP8 GLM4.5-Air fused_cast 256 256 159.53 5662.12 3330.47 (+2.06%) 3239.20 0.572 1282.50 30.70 (+1%)
develop TP8 GLM4.5-Air fused_cast_get_moe_score 256 256 159.53 5604.22 3392.78 (+3.95%) 3298.88 0.589 (+1%) 1458.27 30.54 (+0.5%)

fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比

config T (token数) E (专家数) path_a(µs) path_c(µs) c/a diff idx
deepseek_v3 1 256 24.23 10.18 2.38x 0.00e+00
deepseek_v3 8 256 24.55 11.66 2.11x 0.00e+00
deepseek_v3 32 256 24.28 11.85 2.05x 0.00e+00
deepseek_v3 128 256 24.37 11.87 2.05x 0.00e+00
deepseek_v3 256 256 24.06 12.02 2.00x 0.00e+00
deepseek_v3 512 256 23.91 12.27 1.95x 0.00e+00
deepseek_v3 1024 256 24.16 20.73 1.17x 0.00e+00
deepseek_v3 2048 256 26.77 31.05 0.86x 0.00e+00
deepseek_v3 4096 256 35.95 48.19 0.75x 0.00e+00
deepseek_v3 8192 256 60.40 77.83 0.78x 0.00e+00
glm45_air 1 128 24.08 9.67 2.49x 0.00e+00
glm45_air 8 128 23.89 9.79 2.44x 0.00e+00
glm45_air 32 128 31.09 11.43 2.72x 0.00e+00
glm45_air 128 128 24.34 11.45 2.13x 0.00e+00
glm45_air 256 128 24.58 11.45 2.15x 0.00e+00
glm45_air 512 128 24.54 11.56 2.12x 0.00e+00
glm45_air 1024 128 24.55 11.94 2.06x 0.00e+00
glm45_air 2048 128 24.54 13.05 1.88x 0.00e+00
glm45_air 4096 128 26.33 17.21 1.53x 0.00e+00
glm45_air 8192 128 40.93 29.50 1.39x 0.00e+00

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 11, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 11, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-13 18:19:37

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️ 当前有 1 个 Required 任务失败Approval 审批未通过),另有 7 个 Required 任务运行中,请处理审批后等待其余任务完成。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
38(0) 38 26 1 9 2 0

2 任务状态汇总

2.1 Required 任务:2/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:缺少 FD RD 及 PaddlePaddle RD custom op 审批 联系 FD RD/PaddlePaddle RD 进行 review 审批 Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Run Base Tests / base_tests - 运行中 - Job -
Run Stable Tests / stable_tests - 运行中 - Job -
Run Four Cards Tests / run_4_cards_tests - 运行中 - Job -
Extracted partial CE model tasks / run_ce_cases - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - Job -
其余 2 个必选任务通过(Pre Commit、run_tests_logprob) - - - - -

2.2 可选任务 — 24/28 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
xpu_unit_test / run_xpu_unit_test - Job -
Trigger Jenkins for PR - Job -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
⏸️ CI_HPU - - -
其余 24 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 流程/审批问题(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 流程/审批问题
  • 置信度: 高
  • 根因摘要: PR 新增 custom op,缺少 FastDeploy RD 和 PaddlePaddle RD 各一次审批
  • 分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检查到 PR 新增了 custom op,要求:

  1. 需要一位 FastDeploy RD(qingqing01/Jiang-Jia-Jun/heavengate)的 review 审批
  2. 需要一位 PaddlePaddle RD(jeff41404/yongqiangma)的 review 审批

目前两项均未满足,脚本报告 "There are 2 approved errors.",以 exit code 6 退出。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.

There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请联系以下任意一位 FastDeploy RD 进行 review 审批:@dangqingqing / @jiangjiajun / @DENGKAIPENG
  2. 请联系以下任意一位 PaddlePaddle RD 进行 review 审批:@gaoxiang / @mayongqiang

修复建议摘要: 联系 FD RD 及 PaddlePaddle RD 各进行一次 review 审批

链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

@ShaneGZhu ShaneGZhu marked this pull request as ready for review May 11, 2026 11:53
PaddlePaddle-bot

This comment was marked as outdated.

gongshaotian
gongshaotian previously approved these changes May 11, 2026
Copy link
Copy Markdown
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import (
fused_cast_sigmoid_bias,
)
pass
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ShaneGZhu ShaneGZhu changed the title [Ops][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc [Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc May 11, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@589a721). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...l_executor/layers/moe/fused_moe_cutlass_backend.py 66.66% 1 Missing and 1 partial ⚠️
fastdeploy/model_executor/layers/moe/moe.py 75.00% 1 Missing and 1 partial ⚠️
...el_executor/layers/moe/fused_moe_triton_backend.py 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7777   +/-   ##
==========================================
  Coverage           ?   63.15%           
==========================================
  Files              ?      461           
  Lines              ?    64138           
  Branches           ?     9824           
==========================================
  Hits               ?    40505           
  Misses             ?    20851           
  Partials           ?     2782           
Flag Coverage Δ
GPU 72.27% <80.00%> (?)
XPU 7.13% <8.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

gongshaotian
gongshaotian previously approved these changes May 12, 2026
Copy link
Copy Markdown
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

yongqiangma
yongqiangma previously approved these changes May 12, 2026
Copy link
Copy Markdown
Collaborator

@yongqiangma yongqiangma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…ontrol whether to use the kernel-fused path.
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 18:36:14

📋 Review 摘要

PR 概述:新增 grouped_topk CUDA fused kernel,将 cast + sigmoid + bias + noaux_tc 四步融合为单次 kernel launch,替代原 fused_cast_sigmoid_bias + noaux_tc 双 kernel 路径,并通过新 flag enable_moe_scores_elementwise_fuse 控制启用。

变更范围custom_ops/gpu_ops/(新 CUDA kernel)、fastdeploy/model_executor/layers/moe/(三个 backend + moe.py)、fastdeploy/engine/args_utils.pyfastdeploy/scheduler/config.py

影响面 Tag[OP] [Optimization]


📝 PR 规范检查

标题 [Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc 存在两处问题:①包含两个 Tag(规范要求仅一个);②[Op] 大小写不符(官方列表为 [OP]);③Tag 与描述之间缺少空格。

标题建议(可直接复制):

  • [OP] Kernel fusion: cast+sigmoid+bias+noauxtc

问题

级别 文件 概述
📝 PR 规范 PR 标题 两个 Tag、[Op] 大小写错误、Tag 后缺空格
❓ 疑问 fastdeploy/engine/args_utils.py:344 默认 False 将移除原有 fused_cast_sigmoid_bias 优化,存量部署性能回退
🟡 建议 fastdeploy/model_executor/layers/moe/moe.py:135 use_fused_cast=True + 冗余 EP 路径时,gating_output 未 cast 为 float32 即送入 noaux_tc_redundant
❓ 疑问 fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py:363 移除 FD_ENABLE_RL 保护,RL 模式下启用 fuse flag 时行为变化

总体评价

CUDA kernel 实现(BitonicSort / WarpSelect / Phase 1 & 2)逻辑完整,三个 backend 联动更新,测试覆盖四种典型模型配置,整体质量较好。主要关注点是默认值策略与边界路径的类型安全性,建议确认后合入。

Chunk size of moe input.
"""

enable_moe_scores_elementwise_fuse: bool = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 enable_moe_scores_elementwise_fuse 默认 False 会移除原有默认激活的 fused_cast_sigmoid_bias 优化。

在此 PR 之前,非 RL CUDA 部署默认走 fused_cast_sigmoid_bias + noaux_tc(约 +2% TPS);合入后默认退回纯 Python sigmoid + noaux_tc(等效 baseline)。

存量部署无需改动配置即会悄然丢失原有性能收益,建议在 help 文档或 Release Notes 中说明迁移方式,或评估是否对 CUDA 平台默认开启。

renormalize,
routed_scaling_factor,
)
else:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议use_fused_cast=Trueexpert_id_to_ep_rank_array is not None(冗余 EP 路径)时,调用方(cutlass/triton backend)不会 预先 cast gate_out 到 float32(if not use_fused: gate_out = gate_out.cast("float32") 被跳过),但此处 else 分支直接对 bfloat16/float16 的 gating_outputsigmoid 后与 float32 的 e_score_correction_bias 相加,可能引发隐式 cast 或 noaux_tc_redundant 的类型不匹配。

建议在此 else 分支入口添加 float32 cast 兜底

else:
    if gating_output.dtype != paddle.float32:
        gating_output = gating_output.cast("float32")
    scores = paddle.nn.functional.sigmoid(gating_output)

if fastdeploy.envs.FD_USE_PHI_MOE_PERMUTE and self.moe_quant_type == "w16a16":
if layer.topk_method == "noaux_tc":
use_fused = not fastdeploy.envs.FD_ENABLE_RL and current_platform.is_cuda() and not fc1_latent_proj
use_fused = (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 原有 not fastdeploy.envs.FD_ENABLE_RL 保护已被移除。当用户在 RL 训练模式(FD_ENABLE_RL=True)下同时启用 enable_moe_scores_elementwise_fuse=True 时,fused kernel 将运行于 RL 场景。

请确认:①原 RL 保护是有意为之(RL 模式下 fused kernel 存在正确性/兼容性问题),还是历史遗留约束可以移除?②若 RL 不兼容 fused 路径,需在此处补回保护或在文档中注明。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants